* Replication files for The Effect of Drone Strikes on Civilian Communication: Evidence from Yemen
* PSRM

We describe below how we created the data and empirical results for our paper. We first provide a short practical version which describes how we create the results for our paper. We also provide a longer version which describes in more detail the data creation and results (unable to run due to data confidentiality).

***************************************************************************
Short version (replication package with anonymized data).

Main Results:

Figure 1: CalculateStrikeStatistics.py - makes a map of the drone strikes in Yemen using our drone strike locations. Requires a google maps API. (Data not provided)

Figure 2: yemen_call_figure.R - plots over-time call volume in Yemen using our call volume measures where the cell phone towers are anonymized (call_volume_standard_measures_anom.csv).

Figure 3: figure_coef.R - Plots coefficients for our main regression models. Uses our main dataset (final_data_all.csv)

Figure 4: VisualizeCallVolumes.py - Creates the data for our anomaly detection figures. We choose select examples. The corresponding image names in the output are: ds_27.png, ds_32.png, ds_48.png and ds_54.png

Figure 5:  VisualizeCallVolumes.py - Creates the data for our anomaly detection figures. We choose select examples. 

Appendix Results:


Table A1: appendix_extra_results.R - Creates summary statistics. (Data files not included)

Table A2 and Table A4 (columns 1-4): regression_models.R - Creates the main results for the Online Appendix fixed effects analysis (Online Appendix Table 2). Also replicates the results for other measures of call volume.  

Table A3: zero_inflated_negative_binomial.do - Creates the negative binomial results (in Stata). (Data files not included)

Table A4 (columns 5-6): incoming_outgoing_call_volume.R - Replicates our main results using incoming and outgoing call volume (separately). (Data files not included)

Table A5: 24hours_table.R - Replicates our main results using data collapsed for each day (24 hours). (Data files not included)

Table A6: identification_concerns.R - Creates the results for Section "Robustness Checks" in the Online Appendix that highlight additional identification concerns. 

Table A6 (column 5): nobots_table.R - Replicates our results after removing bots from the call volume data. (Online Appendix Table 6, Column 5) (Data files not included)

Figure A1 and Figure A2: OptimizeROC.py: Implements the extensive tests of the performance of the anomaly detection methods. 

Figure A3: CalculateStrikeStatistics.py - generates histograms for the drone strike statistics. 
****************************************************************************
Long version (using all the data)

1) Data Pre-Processing: 

In this section, we describe the pre-processing stages involved in converting the raw call data and drones data to the final data sets that are used for our analyses. 

A) Call Data (4 files): The code is in the Code/anomaly_detection_analysis/call_volume_pre_processing folder.

(i) AggregateVolumeByTower.py: Code that reads in the raw data (in .npy format), aggregates in 5-minute intervals by each cell tower, separated by day. The data is further segmented into incoming / outgoing volume per tower, and by whether it is bot traffic. The data is stored in .csv and in pickle files of python dicts.

(ii) GenerateBaselineCallVolumes.py: Helper functions and code for generating call volumes in 5-minute intervals for each drone strike / event and its 20 baselines. This file contains function findBaselineWeeks() and other helpful functions. 

findBaselineWeeks(): this function identifies for each day in our data set the relevant 20 dates for comparison. We use this information to generate our standardized measures of call volume for both the fixed effects and detection analyses. Note that the code treats Ramadan as a special case and handles both symmetric and asymmetric dates.

(iii) Blacklist.py: Code used to analyze which users in the network are likely to be bots and saves them in files: blacklist_total_str.p and rts.p. These lists were used to separate how data is aggregated in AggregateVolumeByTower.py. The bot detection algorithm is described in Methods C.

(iv) GraphAnalysis.py: Code for transforming raw data into a graph and calculating various graph metrics. We use the network degree in the bot detection algorithm in Blacklist.py. The code in GraphAnalysis.py also calculates and plots other graph statistics (which we are not using in the paper), such as component size distribution, effective diameter, degree clustering coefficient distribution, and friend of friend distribution on the directed graph created using call data during 2012. 

How to run the pre-processing of the call data:
Note: All python code can be run using python2.7, and the geopy package (https://pypi.org/project/geopy/).

Note: Whenever '/PATH/TO/DATA/9AB-anonymized-numpy-final/' is found in the code, please replace with path to the raw data files in numpy format. For data confidentiality reasons, we do not include the raw data in the replication package. 

cd /Yemen_Replication/Code/anomaly_detection_analysis/call_volume_pre_processing/
python GraphAnalysis.py
python Blacklist.py
python AggregateVolumeByTower.py
python GenerateBaselineCallVolumes.py

As a result:
GraphAnalysis.py will create user_degree.p in the data/ folder and 2012_graph.txt in the misc/ folder. user_degree.p is included in the replication materials. 2012_graph.txt is used for network analysis, which is not part of the results in the paper, and we thus omit 2012_graph.txt from the replication materials.

Blacklist.py uses precomputed file rts.p to produce blacklist_total_str.p in the data/ folder. These files are included in the replication materials.

AggregateVolumeByTower.py will populate the data/tower_counts/ folder with .csv and pickle folders of the aggregated tower counts. We omit these files from the replication materials in order to protect data confidentiality.

GenerateBaselineCallVolumes.py will populate ROC/baseline_call_volumes folder with call volumes for each strike/event (including 20 baselines each). These intermediate files are included in the replication materials.

B) Drones Data (2 files): Our data on drone strikes came from two separate sources: the New America Foundation and the Bureau of Investigative Journalism. The code is in the Code/drone_strikes_pre-processing folder.

(i) compare.R: This code merges the two data sets together and creates our drone dataset (drones_combined_final.csv). Key steps include merging the raw files together and manually adding select covariates based on news sources and Google Maps.

(ii) locations_distances_strikes.py: this code identifies the closest towers for each drone strike in our data. Uses the function getTowersFromCoord(). Creates the raw .csv files for towers for different radii (1-100 miles).

Due to data confidentiality, the precise location of each cell phone tower is not included in the replication folder.


2) Fixed effects analysis: 

This section describes how we conducted the fixed effects analysis (panel model results). All the code is in the Code/fixed_effects_analysis folder. Due to data confidentiality, the raw data sets are not included in the replication folder (only the code and final data set - final_data_all.csv).

A) Data creation (8 files): This code aggregates the call data for the analysis.

(i) clean_call_data.R: Prepares the call volume data (the DV in the analysis). Key stages include appending all the processed call data files, removing artificial zeros for towers that did not operate at the beginning or end of the period, and creating the key DVs by standardizing call volume. Creates call_volume_standard_measures.csv. (Data files not included)

(ii) clean_locations_data.R: Prepares the drone data (the treatment variable in the analysis). Key stages include cleaning the drone data and adding tower locations for each radius. Creates treatment_all.csv. (Data files not included)

(iii) create_final_data_merge.R: Creates the final data set (final_data_all.csv). Merges the drone strike data with the call volume data. Anonymizes the tower locations. (Included)

(iv) clean_call_data_incoming.R: Prepares the incoming call volume data. (Data files not included)

(v) clean_call_data_outgoing.R: Prepares the outgoing call volume data. (Data files not included)

(vi) clean_call_data_nobots.R: Prepares the call volume data that removes bots. (Data files not included)

(vii) clean_call_data_24hours.R: Prepares the call volume data for the 24-hour analysis. (Data files not included)

(viii) treatment_data_24hours.R: Prepares the drone data for the 24-hour analysis. (Data files not included)


B) Data Analysis and Tables/Figures (9 files):

(i) yemen_call_figure.R - plots over-time call volume in Yemen. (Figure 2)

(ii) figure_coef.R - Plots coefficients for our main regression models. (Figure 3)

(iii) appendix_extra_results.R - Creates summary statistics (Online Appendix Table 1). (Data files not included)

(iv) regression_models.R - Creates the main results for the Online Appendix fixed effects analysis (Online Appendix Table 2). Also replicates the results for other measures of call volume. (Online Appendix Table 4, Columns 1-4).  

(v) zero_inflated_negative_binomial.do - Creates the negative binomial results (in Stata). (Online Appendix Table 3)

(vi) incoming_outgoing_call_volume.R - Replicates our main results using incoming and outgoing call volume (separately). (Online Appendix Table 4, Columns 5-6) (Data files not included)

(vii) 24hours_table.R - Replicates our main results using data collapsed for each day (24 hours). (Online Appendix Table 5)

(viii) identification_concerns.R - Creates the results for Online Appendix Section "Robustness Checks" that highlight additional identification concerns. (Online Appendix Table 6, Columns 1-4)

(ix) nobots_table.R - Replicates our results after removing bots from the call volume data. (Online Appendix Table 6, Column 5) (Data files not included)

(x) yemen_drone_strikes_map.R - makes a map of the drone strikes in Yemen. Requires a google maps API. In the maps/ folder. (Figure 1)


3) Anomaly Detection Analysis:

This section describes how we conducted the anomaly detection analysis. All the code is in the Code/anomaly_detection_analysis folder. (11 files)

(i) DetectionMethods.py: Code for the three detection methods described in Online Appendix "Description of Anomaly Detection Methods" (Method 1, Method 2, Method 3).

(ii) RunDetectionMethods.py: Code for running detection methods over grid of parameters.

(iii) OptimizeROC.py: Implements the extensive tests of the performance of the anomaly detection methods. These tests are explained in Online Appendix "Testing the Performance of our Anomaly Detection Methods". Key stages include (i) running the detection methods with all possible combinations of selections for the following dimensions: radius, number of consecutive 5-minute intervals, with and without strikes with dropped towers, with and without strikes with call volume lower than ten calls across all 5-minute intervals of the strike day, with and without strikes that happened during Ramadan, with and without strikes that happened during the Abyan offensive, with and without strikes for which we do not know the time of the day, with a conservative, moderate, or no ``collision’’ check; (ii) producing the ROC curve for each possible combination of the above dimensions, and calculating the AUC (area under the ROC curve); (iii) selecting the setting that yields the highest AUC; (iv) selecting points on the ROC curve that attain the best true positive rate subject to a given false positive rate, and finding the detection rate on the entire dataset for these points; (v) plotting Online Appendix Figures 1 and 2.

(iv) VisualizeCallVolumes.py: Plots call volume figures for drone strikes, al-Qaeda attacks, Eid al-Fitr, the 2010 FIFA World Cup final, and the bombing of the Presidential Palace. (Figures 4 and 5, Online Appendix "Contextual Impact")

(v) CalculateStrikeStatistics.py: Given selected parameters, find statistics about the duration and intensity of call volume anomalies for detected strikes. (Online Appendix Figure 3)

(vi) coord_to_towers.py: helper functions for getting the tower id from coordinates. 

(vii) tower_to_coord.py (in misc/ folder): helper functions for getting the coordinates from tower id.

Matlab code for anomaly detection Method 2 in Online Appendix "Description of Anomaly Detection Methods" (Kernel Density Estimation) in anomaly_detection_analysis/ROC/ folder:
(viii) detect_kernel_nsamples_pvals.m: Finds anomalies using kernel density estimation. Reports anomalies’ duration and quantile in the empirical distribution of the baseline samples. 

(ix) kde.m: Matlab function for kernel density estimator (https://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator)

(x) find_consecutive_ones.m: helper function for finding consecutive detected intervals

(xi) run_kernel.m: Runs anomaly detection Method 2 for drone strikes and other events. (Figures 4 and 5)

How to run the anomaly detection analysis:

Pre-processing the data for the density estimation anomaly detection method (Method 2): 
*****THE BELOW IS A PRE-PROCESSING STEP AND CAN BE SKIPPED************ 
Note: this pre-processing step can be skipped, since the replication materials already include the .csv files that this pre-processing produces. That is, Figures 4 and 5, as well as Online Appendix Figures 1 and 2, can be replicated without running run_kernel.m. 
run_kernel.m is only used for pre-processing, and the replication files already include the necessary .csv files to run the visualization of call volumes (Figures 4 and 5) and the selection of detection thresholds (Online Appendix Figures 1 and 2).  

If one wants to replicate the pre-processing itself, then one would need to first delete the respective folders folder_name/matlab_pvals (where folder_name = 'matlab_kernel_pvals/' for drone strikes, and 'detection/alqaeda/', 'detection/bombing/', 'detection/fifa/' for non-strike events); and then follow the below instructions:

In MATLAB, set the working directory to /Yemen_Replication/Code/anomaly_detection_analysis/ROC/.
Run file run_kernel.m 
Note: for run_kernel.m, please edit the variable folder_name, to run analysis for drone strikes, or other events (see commented out folder_name values in file).
The result of running run_kernel.m is that subfolders matlab_kernel_pvals/matlab_pvals, and also detection/alqaeda/matlab_pvals, detection/bombing/matlab_pvals, detection/fifa/matlab_pvals get populated with .csv files. The Python routine detectStrikeKernel (in RunDetectionMethods.py) then operates on these .csv files.
****************************************************************

In the terminal:
cd /Yemen_Replication/Code/anomaly_detection_analysis/
python RunDetectionMethods.py
python OptimizeROC.py

Note: RunDetectionMethods.py will populate the data_conservative/, data_moderate/, data_none/ folders in anomaly_detection_analysis/ROC/. These .csv files give the results of our detection for each duration, radius, and threshold parameter, and are included in the replication materials.

Note: When OptimizeROC.py is run, it will generate ROC curve plots. It will generate the ROC curve for all parameters tested, and will print in order of highest to lowest AUC. Online Appendix Figures 1 and 2 show the ROC curves and the AUC for the setting that results in the highest average AUC across the three detection methods.

How to generate call volume plots (Figures 4 and 5):
cd /Yemen_Replication/Code/anomaly_detection_analysis/
python VisualizeCallVolumes.py

In particular, to produce Figures 4 and 5 the code only uses preprocessed data from 'ROC/baseline_call_volumes/15miles_conservative.csv' and the corresponding non-strike data, such as 'ROC/detection/alqaeda/aq_15miles_conservative.csv'. 


How to generate a histogram of strike statistics (Online Appendix Figure 3):
cd /Yemen_Replication/Code/anomaly_detection_analysis/
python CalculateStrikeStatistics.py

Note: All plots will be saved in the figures/ folder. We only include a small subset of these plots in the paper submission.

Running the anomaly detection analysis using intermediate datasets:
Although the full raw dataset will not be released, RunDetectionMethods.py, VisualizeCallVolumes.py, and CalculateStrikeStatistic.py can be run locally using the intermediate data files that can be found in the anomaly_detection_analysis/ROC/ folder. To run OptimizeROC.py, intermediate datasets of 11G are needed (in anomaly_detection_analysis/data/tower_counts/), which we omit from the replication materials in order to protect data confidentiality.


4) Format of the Raw Data
We omit the files described below, for confidentiality reasons. The notes below are from the official readme file of the dataset, as curated by the SNAP group at Stanford (http://snap.stanford.edu/index.html).

This is the preprocessed dataset with all pre- and post-paid calls and entity
resolution as a set of daily files. The size of the whole compressed dataset
is 185G.  Each file contains the calls that started on that day (these calls
may continue on the next day, but they are represented only once in the file
corresponding to the day they started), sorted by start time.

There are two additional files, one is a tower map from id to integer.  We
mapped the ids because it is faster to process a file that contains only
numbers. The entries "-1" are missing values, and there are no other negative
numbers in the data.

The other file is a phonebook, which contains every number in the dataset,
together with a type with five categories, Cell (Cell Phone), LL (landline),
Intl (International), Short (short code service), and RT (Router). They are
mapped to a region (usually a country or city) when possible, and the last two
columns tell you whether this number is well-formatted (possible), and whether
it is valid (it is valid assigned exchange).

The format of the daily files is:

0. 0 (postpaid) or 1 (prepaid)
1. 1 (Call) or 2 (TXT)
2. Start timestamp (seconds since epoch*)
3. End timestamp (seconds since epoch*)
4. Caller
5. Caller's tower
6. Callee
7. Callee's tower

* epoch is Unix epoch, namely Jan 1, 1970, 12:00 AM.

and the phonebook format is

1. Number
2. Type (Cell, LL, Intl, Short, or RT)
3. Region
4. Possible?
5. Valid?

"-1"s are missing entries

